library(tidyverse)
library(lme4)
library(emmeans)
library(car) # for vif
library(bbmle) # for AICtab
library(sjPlot)
library(knitr)
theme_set(ggthemes::theme_few())
Mixed modeling with all relevant variables predicting accuracy
From the preregistration, the mixed model was specified thusly:
correct ~ delay * age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject/block/hiding_location ) +
(1 + task_experience + cup_distance + board_size + trial + delay | species)
In the dataframe, subject_site = subject, and norm_age should be used for age.
Model as pre-registered has too many random effects
Error: number of observations (=6246) < number of random effects (=10608) for term (1 + delay + trial | hiding_location:(block:(subject_site:site))); the random-effects parameters are probably unidentifiable
Pruning random effects in the following order (from preregistration):
- Remove correlations between random effects
- Remove random slopes (in the following order)
specieshiding_locationblocksubject
Model only converges once we take out hiding_location. After doing so, the other random effects (correlation, site, species) can be put back in.
The model below converges. Model output is saved in 06_mp_model_v2.rds
correct ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject_site/block) +
(1 + task_experience + cup_distance + board_size + trial + delay | species)
After pruning random effects with little variability and removing board_size, which covaried with cup_distance, the reduced model has the following structure. It is saved in 06_mp_3_model3_v2.rds
correct ~ delay * norm_age +
task_experience + cup_distance + trial +
(1 + delay | site/subject_site) +
(1 + delay | species)
Data import
mp_data <- read.csv("../data/merged_data/01_manyprimates_pilot_merged_data_v2.csv")
Prepare code for pre-registered mixed modeling
cup_distance, board_size and trialmodel.data <- mp_data %>%
filter(species != "black_faced_spider_monkey") %>%
mutate_at(vars(cup_distance, board_size, trial), funs(scale(.)[,1])) %>%
mutate(hiding_location = factor(hiding_location),
delay = fct_relevel(delay, "short"))
The model takes a while to run. Run next line to load model output from previous run with structure below.
# mm.1 <- readRDS("06_mp_model.rds")
mm.1 <- readRDS("06_mp_model_v2.rds")
mm.1 = glmer(correct ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject_site/block) +
(1 + task_experience + cup_distance + board_size + trial + delay | species)
, data = model.data
, family = binomial
, control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
saveRDS(mm.1, "06_mp_model_v2.rds")
Some diagnostics
theta <- getME(mm.1, "theta")
diag.element <- getME(mm.1, "lower") == 0
any(theta[diag.element] < 1e-5)
[1] TRUE
Confirm model structure
# mm.1@call
formula(mm.1)
correct ~ delay * norm_age + task_experience + cup_distance +
board_size + trial + (1 + delay + trial | site/subject_site/block) +
(1 + task_experience + cup_distance + board_size + trial +
delay | species)
glance(mm.1) %>% kable(digits = 2)
| sigma | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|
| 1 | -3385.22 | 6906.44 | 7364.74 | 6364.34 | 6178 |
fmt = function(num, digits) return(round(num, digits))
VarCorr(mm.1) %>% print(formatter = fmt, digits = 3) # comp = c("Variance", "Std.Dev.")
Groups Name Std.Dev. Corr
block:(subject_site:site) (Intercept) 0
delaylong 0 NaN
delaymedium 0 NaN 0.95
trial 0 NaN 0.76 0.91
subject_site:site (Intercept) 0.861
delaylong 0.653 -0.91
delaymedium 0.533 -0.85 0.99
trial 0.093 -0.25 0.63 0.72
site (Intercept) 0.849
delaylong 0.54 -1.00
delaymedium 0.652 -0.99 1.00
trial 0.092 0.86 -0.81 -0.81
species (Intercept) 0.532
task_experienceyes 0.066 1.00
cup_distance 0.025 -1.00 -1.00
board_size 0.112 -1.00 -1.00 1.00
trial 0.005 -1.00 -1.00 1.00 1.00
delaylong 0.449 -1.00 -1.00 1.00 1.00 1.00
delaymedium 0.388 -1.00 -1.00 1.00 1.00 1.00 1.00
CIs
mm.1.ci = confint(mm.1, method = 'Wald') %>% # bootstrap these later
as.data.frame %>%
rownames_to_column %>%
filter(complete.cases(.)) %>%
rename(LL = `2.5 %`, UL = `97.5 %`) %>%
mutate(OR_LL = exp(LL), OR_UL = exp(UL))
coef(summary(mm.1)) %>%
as.data.frame %>%
rownames_to_column() %>%
mutate(OR = exp(Estimate)) %>%
left_join(mm.1.ci, by = 'rowname') %>%
select(rowname, OR, OR_LL, OR_UL, Estimate, LL, UL, everything()) %>%
kable(digits = 3)
| rowname | OR | OR_LL | OR_UL | Estimate | LL | UL | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5.243 | 2.440 | 11.266 | 1.657 | 0.892 | 2.422 | 0.390 | 4.245 | 0.000 |
| delaylong | 0.262 | 0.150 | 0.457 | -1.340 | -1.897 | -0.783 | 0.284 | -4.716 | 0.000 |
| delaymedium | 0.340 | 0.190 | 0.608 | -1.080 | -1.663 | -0.497 | 0.297 | -3.632 | 0.000 |
| norm_age | 1.021 | 0.824 | 1.265 | 0.021 | -0.194 | 0.235 | 0.109 | 0.188 | 0.851 |
| task_experienceyes | 1.019 | 0.714 | 1.454 | 0.019 | -0.337 | 0.375 | 0.181 | 0.105 | 0.917 |
| cup_distance | 1.997 | 1.550 | 2.573 | 0.692 | 0.438 | 0.945 | 0.129 | 5.352 | 0.000 |
| board_size | 1.278 | 0.983 | 1.662 | 0.245 | -0.017 | 0.508 | 0.134 | 1.831 | 0.067 |
| trial | 1.029 | 0.922 | 1.148 | 0.029 | -0.081 | 0.138 | 0.056 | 0.510 | 0.610 |
| delaylong:norm_age | 1.015 | 0.820 | 1.256 | 0.015 | -0.198 | 0.228 | 0.109 | 0.136 | 0.892 |
| delaymedium:norm_age | 1.058 | 0.859 | 1.304 | 0.056 | -0.152 | 0.265 | 0.106 | 0.530 | 0.596 |
corr = cov2cor(vcov(mm.1)) %>% as.matrix %>% round(2)
corr[upper.tri(corr, diag = T)] = ''
colnames(corr) = 1:10
rownames(corr) = str_c(1:10, ' ', rownames(corr))
corr %>% as.data.frame %>% select(-10) %>% rownames_to_column
based on estimated marginal means
Note. This wasn’t in the preregistration.
emmeans(mm.1, pairwise ~ delay, type = 'response')$contrasts
contrast odds.ratio SE df z.ratio p.value
short / long 3.8183950 1.08483113 Inf 4.716 <.0001
short / medium 2.9435164 0.87514908 Inf 3.631 0.0008
long / medium 0.7708779 0.07336213 Inf -2.734 0.0172
Results are averaged over the levels of: task_experience
P value adjustment: tukey method for comparing a family of 3 estimates
Tests are performed on the log odds ratio scale
plot_model(mm.1, title = 'Fixed Effects', order.terms = c(7, 4, 3:1, 9:8, 5, 6), width = .3,
show.values = T, value.size = 2.5, value.offset = .3) +
geom_hline(yintercept = 1, lty = 2) +
ylim(0, 3)
ranef.plots = plot_model(mm.1, type = 're', sort.est = '(Intercept)')
In line with the model summary above, there’s essentially zero variability in the random effects estimates for this.
ranef.plots[[1]]
ranef.plots[[2]]
ranef.plots[[3]]
ranef.plots[[4]]
block from random effects as the estimates in the previous models were essentially 0trial random slopes within speciescorrect ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject_site ) +
(1 + task_experience + cup_distance + board_size + delay | species)
col.mm1 <- glm(correct ~ delay + norm_age +
task_experience + cup_distance + board_size + trial
, data = model.data
, family = binomial)
vif(col.mm1)
GVIF Df GVIF^(1/(2*Df))
delay 1.008742 2 1.002178
norm_age 1.111308 1 1.054186
task_experience 1.048604 1 1.024013
cup_distance 2.068924 1 1.438375
board_size 1.945906 1 1.394957
trial 1.000394 1 1.000197
board_size and cup_distance show high colinearity
Remove board_size as it is highly correlated with cup_distance. Cup distance seems to be of more immediate relevance.
correct ~ delay * norm_age +
task_experience + cup_distance + trial +
(1 + delay + trial | site/subject_site ) +
(1 + task_experience + cup_distance + delay | species)
Check how many different levels there are within each random effect
source("diagnostic_fcns.r")
Overview = fe.re.tab("correct ~ delay + task_experience + cup_distance + trial", "species", data = model.data)
Overview$summary
$`delay_within_species (factor)`
3
11
$`task_experience_within_species (factor)`
1 2
9 2
$`cup_distance_within_species (covariate)`
1 2 4
7 3 1
$`trial_within_species (covariate)`
36
11
This suggests that, within species, random slopes for task_experience does not make much sense as most species have only 1 level. Same is true for cup_distance. Indeed, the model summary and random effects plot for species confirm that there is little variability in these estimates (they’re close to zero). Therefore they are removed.
correct ~ delay * norm_age +
task_experience + cup_distance + trial +
(1 + delay + trial | site/subject_site ) +
(1 + delay | species)
The model takes a while to run. Run next line to load model output from previous run with structure below.
mm.2 <- readRDS("06_2_mp_model2_v2.rds")
# mm.2.ci<- readRDS("06_2_mp_model2_ci_v2.rds")
mm.2 <- glmer(correct ~ delay * norm_age +
task_experience + cup_distance + trial +
(1 + trial + delay | site / subject_site ) +
(1 + delay | species)
, data = model.data
, family = binomial
, control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
saveRDS(mm.2, "06_2_mp_model2_v2.rds")
Remove trial from the random slopes for subject/site as it’s near zero both in mm.1 and even more so in mm.2
VarCorr(mm.2) %>% print(comp = c("Variance", "Std.Dev."), formatter = fmt, digits = 3)
Groups Name Variance Std.Dev. Corr
subject_site:site (Intercept) 0.756 0.869
trial 0.008 0.092 -0.25
delaylong 0.43 0.656 -0.91 0.63
delaymedium 0.288 0.536 -0.84 0.74 0.99
site (Intercept) 0.936 0.968
trial 0.008 0.09 0.86
delaylong 0.4 0.632 -1.00 -0.81
delaymedium 0.499 0.707 -0.99 -0.79 1.00
species (Intercept) 0.353 0.594
delaylong 0.19 0.436 -1.00
delaymedium 0.132 0.363 -1.00 1.00
plot_model(mm.2, type = 're', sort.est = '(Intercept)')[[1]]
plot_model(mm.2, type = 're', sort.est = '(Intercept)')[[2]]
The model takes a while to run. Run next line to load model output from previous run with structure below.
mm.3 <- readRDS("06_3_mp_model3_v2.rds")
mm.3 <- glmer(correct ~ delay * norm_age +
task_experience + cup_distance + trial +
(1 + delay | site / subject_site ) +
(1 + delay | species)
, data = model.data
, family = binomial
, control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
saveRDS(mm.3, "06_3_mp_model3_v2.rds")
Confirm model structure
formula(mm.3)
correct ~ delay * norm_age + task_experience + cup_distance +
trial + (1 + delay | site/subject_site) + (1 + delay | species)
glance(mm.3) %>% kable(digits = 2)
| sigma | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|
| 1 | -3391.12 | 6836.24 | 7018.21 | 6386.6 | 6219 |
VarCorr(mm.3) %>% print(comp = c("Variance", "Std.Dev."), formatter = fmt, digits = 3)
Groups Name Variance Std.Dev. Corr
subject_site:site (Intercept) 0.752 0.867
delaylong 0.427 0.654 -0.90
delaymedium 0.27 0.519 -0.85 0.99
site (Intercept) 1.018 1.009
delaylong 0.472 0.687 -1.00
delaymedium 0.547 0.74 -1.00 1.00
species (Intercept) 0.345 0.587
delaylong 0.19 0.436 -1.00
delaymedium 0.134 0.367 -1.00 1.00
CIs
# this is not currently run
source("boot_glmm.r")
mm.3.ci = boot.glmm.pred(model.res=mm.3, excl.warnings=F, nboots=1000, para=F, resol=100, level=0.95, use=NULL, circ.var.name=NULL, circ.var=NULL, use.u=F,n.cores=c("all-1", "all"), save.path=NULL)
saveRDS(mm.3.ci, "06_3_mp_model3_ci_v2.rds")
mm.3.ci = confint(mm.3, method = 'Wald') %>% # bootstrap these later
as.data.frame %>%
rownames_to_column %>%
filter(complete.cases(.)) %>%
rename(LL = `2.5 %`, UL = `97.5 %`) %>%
mutate(OR_LL = exp(LL), OR_UL = exp(UL))
coef(summary(mm.3)) %>%
as.data.frame %>%
rownames_to_column() %>%
mutate(OR = exp(Estimate)) %>%
left_join(mm.3.ci, by = 'rowname') %>%
select(rowname, OR, OR_LL, OR_UL, Estimate, LL, UL, everything()) %>%
kable(digits = 3)
| rowname | OR | OR_LL | OR_UL | Estimate | LL | UL | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 6.693 | 2.920 | 15.343 | 1.901 | 1.072 | 2.731 | 0.423 | 4.492 | 0.000 |
| delaylong | 0.229 | 0.127 | 0.412 | -1.476 | -2.066 | -0.886 | 0.301 | -4.905 | 0.000 |
| delaymedium | 0.298 | 0.166 | 0.537 | -1.210 | -1.797 | -0.623 | 0.299 | -4.039 | 0.000 |
| norm_age | 1.023 | 0.824 | 1.269 | 0.022 | -0.194 | 0.238 | 0.110 | 0.203 | 0.839 |
| task_experienceyes | 0.943 | 0.701 | 1.269 | -0.058 | -0.355 | 0.238 | 0.151 | -0.385 | 0.700 |
| cup_distance | 2.112 | 1.699 | 2.624 | 0.747 | 0.530 | 0.965 | 0.111 | 6.743 | 0.000 |
| trial | 1.027 | 0.968 | 1.090 | 0.027 | -0.033 | 0.086 | 0.030 | 0.874 | 0.382 |
| delaylong:norm_age | 1.009 | 0.817 | 1.246 | 0.009 | -0.202 | 0.220 | 0.108 | 0.083 | 0.934 |
| delaymedium:norm_age | 1.050 | 0.857 | 1.288 | 0.049 | -0.155 | 0.253 | 0.104 | 0.474 | 0.636 |
based on estimated marginal means
emmeans(mm.3, pairwise ~ delay, type = 'response')$contrasts
contrast odds.ratio SE df z.ratio p.value
short / long 4.3746081 1.31625922 Inf 4.905 <.0001
short / medium 3.3515380 1.00366654 Inf 4.039 0.0002
long / medium 0.7661345 0.06572278 Inf -3.105 0.0054
Results are averaged over the levels of: task_experience
P value adjustment: tukey method for comparing a family of 3 estimates
Tests are performed on the log odds ratio scale
plot_model(mm.3, title = "Fixed Effects", order.terms = c(6, 4, 3:1, 8:7, 5), width = .3,
show.values = T, value.size = 2.5, value.offset = .3) +
geom_hline(yintercept = 1, lty = 2) +
ylim(.05, 3)
ggsave('../graphs/05_forestplot.png', width = 4, height = 2.5, scale = 2)
ranef.plots2 = plot_model(mm.3, type = 're', sort.est = '(Intercept)')
ranef.plots2[[1]]
ranef.plots2[[2]]
ranef.plots2[[3]]
mm.4 <- glmer(correct ~ delay * norm_age +
task_experience + cup_distance + trial +
(1 + delay | species)
, data = model.data
, family = binomial
, control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
We’re looking for the lowest AIC(c) as the model with the ‘best fit’ with a reasonable number of parameters. (Too many are penalized by AIC as one way to address overfitting.)
Indeed, the reduced model seems to do a better job of striking that balance between fitting the data with fewer parameters.
AICctab(mm.1, mm.2, mm.3, mm.4, logLik = T, weights = T)
dLogLik dAICc df weight
mm.3 83.7 0.0 27 0.9961
mm.2 86.2 11.1 35 0.0039
mm.1 89.6 71.5 68 <0.001
mm.4 0.0 143.2 15 <0.001
anova(mm.1, mm.2, mm.3, mm.4)
Data: model.data
Models:
mm.4: correct ~ delay * norm_age + task_experience + cup_distance +
mm.4: trial + (1 + delay | species)
mm.3: correct ~ delay * norm_age + task_experience + cup_distance +
mm.3: trial + (1 + delay | site/subject_site) + (1 + delay | species)
mm.2: correct ~ delay * norm_age + task_experience + cup_distance +
mm.2: trial + (1 + trial + delay | site/subject_site) + (1 + delay |
mm.2: species)
mm.1: correct ~ delay * norm_age + task_experience + cup_distance +
mm.1: board_size + trial + (1 + delay + trial | site/subject_site/block) +
mm.1: (1 + task_experience + cup_distance + board_size + trial +
mm.1: delay | species)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mm.4 15 6979.6 7080.7 -3474.8 6949.6
mm.3 27 6836.2 7018.2 -3391.1 6782.2 167.3967 12 <2e-16 ***
mm.2 35 6847.1 7083.0 -3388.6 6777.1 5.0907 8 0.7478
mm.1 68 6906.4 7364.7 -3385.2 6770.4 6.7083 33 1.0000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Difference
coef1 = coef(summary(mm.1))[c(2,3,6), 1]
coef2 = coef(summary(mm.3))[c(2,3,6), 1]
coef2 - coef1
delaylong delaymedium cup_distance
-0.1359696 -0.1297933 0.0556988
Difference in odds ratios
exp(coef2) - exp(coef1)
delaylong delaymedium cup_distance
-0.03329287 -0.04134608 0.11440274